The Hungarian Gigaword Corpus
نویسندگان
چکیده
The paper reports on the development of the Hungarian Gigaword Corpus, an extended new edition of the Hungarian National Corpus, with upgraded and redesigned linguistic annotation and an increased size of 1.5 billion tokens. Issues concerning the standard steps of corpus collection and preparation are discussed with special emphasis on linguistic analysis and annotation due to Hungarian having some challenging characteristics with respect to computational processing.
منابع مشابه
Creating Open Language Resources for Hungarian
The paper provides an overview of the open source Hungarian language resources that the SzóSzablya ‘WordSword’ project is creating. An extensive crawl of the .hu domain yielded a raw dataset of over 18m web pages. We discuss the methods used to detect and remove duplicates, low quality, foreign, and mixed language documents, and describe the resulting gigaword corpus and various frequency count...
متن کاملAutomatic Acquisition of Linguistic Knowledge: From Sinica Corpus to Gigaword Corpus
The raison d’etre for a corpus, as it was first conceived by Francis and Kucera in 1963, was to provide a body of linguistic facts from which linguistic knowledge could be generalized, [1]. The methods of acquisition have evolved as corpus size and technology have advanced in the past 40 years. Originally corpus-based concordances assisted linguists to form generalizations. This was what Fillmo...
متن کاملWord Usage : Newspaper Text versus the Web
This paper explores the differences in words and word usage in two corpora – one derived from newspaper text and the other from the web. A corpus of web pages is compiled from a controlled traversal of the web, producing a topicdiverse collection of 2 billion words of web text1. We compare this Web Corpus with the Gigaword Corpus, a 2 billion word collection of news articles. The Web Corpus is ...
متن کاملUsing Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research
We explore the possibility of deeper linguistic research based on corpus and computational linguistic tools in this paper. In particular, we adopt Chinese Word Sketch, the application of Word Sketch Engine to Chinese GigaWord Corpus, for linguistic research. We apply Chinese Sketch Engine results to deeper linguistic account such as selectional restriction and event type selection. The study is...
متن کاملWords and Word Usage: Newspaper Text versus the Web
This paper explores the differences in words and word usage in two corpora – one derived from newspaper text and the other from the web. A corpus of web pages is compiled from a controlled traversal of the web, producing a topicdiverse collection of 2 billion words of web text1. We compare this Web Corpus with the Gigaword Corpus, a 2 billion word collection of news articles. The Web Corpus is ...
متن کامل